Chinese AI Model Trained on Synthetic Data Outperforms Larger Models Using Nvidia Chips
A breakthrough in artificial intelligence demonstrates the potential of synthetic data. Researchers from Tsinghua University, Microsoft Research Asia, and Wuhan University developed a 7-billion-parameter AI model called X-Coder, trained entirely on artificially generated data using Nvidia's H20 and H200 chips. The model outperformed coding systems twice its size trained on human-generated data.
The team Leveraged Nvidia's export-controlled hardware, utilizing 128 H20 chips for 220 hours of supervised fine-tuning followed by 32 H200 chips for seven days of reinforcement learning. This strategic hardware selection—combining inference-optimized H20 and training-focused H200 processors—proved critical given current US export restrictions.
Findings published on arXiv reveal synthetic data follows established scaling laws, challenging conventional wisdom about training data requirements. The SynthSmith pipeline's success suggests compute power, not data sourcing, may become the primary constraint in AI development.